0%

(ICCV 2017) Identity-aware textual-visual matching with latent co-attention

Li S, Xiao T, Li H, et al. Identity-aware textual-visual matching with latent co-attention[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1890-1899.



1. Overview


1.1. Motivation

  • most existing methods tackle textual-visual matching problem without effectively utilizing identity-level annotations
  • RNNs have difficulty in remembering the complete sequential information of very long sentences
  • RNNs are variant to different sentence structures

In this paper, it proposed an identity-aware two-stage framework

  • stage-1. learn to embed cross-modal feature with Cross-Modal Cross-Entropy (CMCE) loss; provide initial training point for stage-2; screen easy incorrect matchings
  • stage-2. refine the matching results with a latent co-attention mechanism
  • spatial attention. relates each word with corresponding image region
  • latent semantic attention. aligns different sentence structures to make the matching results more robust to sentence structure variation; at each step of the LSTM, it learns how to weight different words’ features to be more invariant to sentence structure variations


1.2. Contribution

  • identity-aware two-stage framework
  • CMCE loss
  • latent co-attention mechanism

1.3.1. Visual Matching with Identity-Level Annotation

  • person re-identification
  • face recognition
    • classify all identities simultaneously. face challenge when number of classes is too large
    • pair-wise or triplet distance loss function. hard negative training samples might be difficult to sample as the number of training sample increases

1.3.2. Textual-Visual Matching

  • image caption
  • VQA
  • text-image embedding



2. Methods


2.1. Stage-1 with CMCE Loss



  • map image and description into a joint feature embedding space
  • Cross-Modal Cross-Entropy (CMCE) to minimize intra-identity and maximize inter-identity feature distances

2.1.1. Cross-Modal Cross-Entropy Loss

  • pair-wise or triplet loss. N identities contains O(N^2) training samples, difficult to sample hard negative samples
  • CMCE. compare each identity in mini-batch from one modality to all N identites in another modality, can cover all hard negative samples

  • cross-modal affinity. inner products of features from the two modalities

  • textual and visual feature buffers. enable efficient calculation of textual-visual affinities
  • before the first iteration. if an identity has multiple descriptions or images, its stored features in the buffers are the average of the multiple samples
  • in each iteration. (calc loss)-(BP)-(update corresponding rows in buffer for sampled identity), if identity t has multiple images or descriptions update by


  • affinity between one image feature v and ith textual feature S_i.
  • σ. temperature hyper-parameter to control how peaky the probability distribution



  • affinity between one textual feature s and kth image feature V_k.



  • maximize the probability of corresponding identity pairs

2.2. Stage-2 with Latent Co-attention

  • stage-1. visual and text feature embeddings might not be optimal, compress the whole sentence into a single vector
  • stage-1. sensitive to sentence structure variation


  • input. a pair of text description and image
  • output. matching confidence
  • trained stage-1 network serves as the initial point for the stage-2 network
  • only hard negative matching samples from stage-1 results are utilized for training stage-2

2.2.1. Encoder Word-LSTM with Spatial Attention

  • generate the weight between word and L regions
  • sum of all weighted region
  • concate the word and sum


  • word feature of LSTM. H={h_1, …, h_T}, H∈(D_H x T)
  • image feature. I={i_1, …, i_L}, I∈(D_I, L)
  • D_H. dimension of hidden state
  • D_I. dimension of image region feature
  • T. number of words
  • L. number of region
  • W_I, W_H. transform feature to K-dimension space
  • W_P. conver feature to affinity score


  • sum of weighted L regions according to a word


  • concate the word and its weighted region

2.2.2. Decoder LSTM with Latent Semantic Attention

  1. generate weights between last hidden state and all x_t
  2. sum of weighted x_t
  3. process and fed into next LSTM step
  • LSTM not robust to sentence structure variations
  • decoder LSTM with latent semantic attention automatically align sentence structure
  • M-step decoder LSTM processes the encoded feature step by step while searches through the entire input sentence to align the image-word features x_t, at mth step


  • f. two-layer CNN weight the importance of jth word for mth decoding step
  • c_{m-1}. hidden state by decoder LSTM for step m-1



  • x_m. transform by two-FC layers

  • LSTM is able to focus more on relevant information by re-weighting the source image-word features to enhance the network’s robust to sentence structure variation


  • easier training samples are filtered out by the stage-1 network
  • N’. the number of training samples for training the stage-2



3. Experiments


3.1. Dataset

  • CUHK-PEDES. two descriptions
  • Caltech-UCSD birds (CUB). ten descriptions
  • Oxford-102 Flowers. ten descriptions

3.2. Details

  • σ=0.04
  • Adam of LSTM, SDG of CNN
  • training and testing samples are screened by the mathcing results of stage-1
  • for each visual or textual sample, we take its 20 most similar samples from the other modality by stage-1 and construct textual-visual pair samples for stage-2 training and testing

3.3. Comparison




3.4. Ablation Study



  • CMCE loss vs Triplet loss. 3 times more training time than CMCE
  • identity number vs counting number.
  • latent semantic attention vs remove it. align visual and semantic concepts, mitigate sensitivity to sentence structure
  • spatial attention vs concate visual and textual feature.
  • stage-1 vs w/o stage-1